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The ANSI /IEEE Standard 854 - 1987 for floating-point arithmetic is interpreted by 
converting the lexical descriptions in the standard into mathematical conditional 
descriptions organized in tables. The standard is represented in higher-order 
logic within the framework of the HOL system. The paper is divided in two parts 
with the first part the interpretation and the second part the description in HOL. 


The objective of this work is to provide a representation of the IEEE-854[6] standard in a formal logic 
system against which implementations can be verified using deductive reasoning. Before the standard 
could be represented in the formal system it was necessary to extract the meaning of the standard for 
numerous conditions and cases. Hence, the interpretation of the standard became part of the effort. Paul 
Miner provided valuable discussions to aid in the standard interpretation and worked in a similar effort to 
specify the IEEE-854 standard in the PVS system[7]. 

Previous efforts to represent a floating point arithmetic standard in a formal language include the par- 
tial formalization of the ANSI/IEEE-754-1985[5] standard in the Z language by Geoff Barret[l] and in the 
HOL system by Jing Pang[8]. IEEE-854 is a generalization of IEEE-754. IEEE-854 does not specify 
encoding formats for floating-point numbers and permits the representation of floating-point numbers in 
the binary and decimal systems. 

The interpretation of the standard is not intended to replace the standard but rather aid in its under- 
standing. Although the standard has been reviewed meticulously to get a full understanding of its meaning, 
errors probably exist in the interpretation. Any discrepancies between the interpretation and the standard 
should be considered an interpretation error and the standard should take precedence. 


Part 1: Interpretation 

1 Introduction 

This part of the paper covers the interpretation of ANSI/IEEE Standard 854-1987. The interpretation 
consists of the definition of 29 tables which address the cases found during floating-point rounding and 
arithmetic operations. Operations on infinity, zero, and symbolic entries, as well as exceptions and traps 
are incorporated directly in the definition of each operation rather than in separate sections. 

2 Floating-point Numbers and Precisions 

This section contains a brief definition of floating-point numbers and floating-point precisions. A float- 
ing-point number is a digit string characterized by three components: a sign digit, a signed exponent, and a 
significand. A floating-point number can have three meanings: 1. a value; 2. an infinite; and 3. not a num- 
ber (NaN). Values, infinities, and NaNs are further divided into classes. A value could be a normal number, 
subnormal number, or zero. An infinite could be positive or negative. NaNs could be signaling or quiet. A 



subnormal number is a nonzero valued number whose magnitude is less than the base raised to the preci- 
sion's minimum exponent. IEEE-854 defines four precision: single, double, single extended, and double 
extended. Each precision is defined by the following parameters: 

the radix or base 

the number of base-b digits in the significand 
the maximum exponent 

the minimum exponent 

For all precisions, the parameters are subjected to the following constraints: 

b shall be either 2 or 10 and shall be the same for all supported precisions 
(£„—£’•) / p shall exceed 5 and should exceed 10 

v max min' r 

ft'” 1 * 10 s 



Additional constraints are imposed on the parameters for double, single 
extended, and double extended precisions. For double precision: 


jP d ln i 

b a lOo 
E a 8 £ 


s:8E; 


+ 7 


where the subscripts d and s denote double and single precisions respectively. For extended precision, the 
following constraints must hold over the base precision: 


E *8 E +7 

max e max b 

E . s8 E . 

min„ min. 


Pe 2 l2 Pb 

for fc = 2 p e * p b + f log 2 (E maXb - E mm ) ] 


where the subscript e and b denote the extended and base precisions. 

Thus, each precision allows the representation of just the following 
entities: 


s E 

1. Numbers of the form (-1 )b (d Q .d 1 d 2 ...d p _ l ) where 

s = a natural number defining the algebraic sign 
E = any integer between E m - n and E max , inclusive 

d i = a base-b digit ( b — 1 ) ' 
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2. Two infinities, + <» and — » 

3. At least one signaling NaN 

4. At least one quiet NaN 

3 Exceptions and TVaps 

Operations on floating-point numbers, defined in succeeding sections, can signal exceptions as a result 
of performing the operation. The generation of exceptions will depend on operands, results, and operation 
conditions. Five exceptions are signaled when detected: 

Invalid operation 

Division by zero 

Overflow 

Underflow 

Inexact 

An exception will set a status flag and, if enabled by the user, will invoke an exception handling trap. If 
exception handlers are implemented then each exception should have a user controlled trap associated with 
it. 

The resulting value on some operations will depend on whether an exception is detected and a trap is 
enabled. Conditions which will result in exceptions will be defined within the operation’s definition. 


4 Rounding 

Floating-point numbers are intended to be a finite approximation of the real numbers. Rounding is 
defined in the IEEE-854 standard thus, 

Rounding takes a number regarded as infinitely precise and, if necessary, modifies it to fit the destina- 
tion’s precision while signaling the inexact exception ( see 7.5). [6, section 4, page 9] 

Four rounding modes are specified in the standard: 

An implementation of this standard shall provide round to nearest as the default rounding mode.f...] 
An implementation of this standard shall also provide three user-selectable directed rounding modes: 

round towards + oo, round towards — oo, and round towards 0. [6, section 4.1, page 9] 

In addition, depending on the magnitude of the number to be rounded and the rounding mode, rounding 
can produce infinite floating-points while signaling other exceptions: 

The rounding modes may affect the signs of zero sums ( see 6.3), and do affect the threshold beyond 
which overflow ( see 7.3) and underflow (see 7.4) may be signaled. [6, section 4, page 9] 
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Round to near returns the floating-point number with value nearest to the infinitely precise number. If 
two floating-point number values are equally near, round to near returns the one with least significant digit 
even. Round to positive infinity returns a floating-point number with the smallest value which is greater 
than the infinitely precise number. Round to negative infinity returns a floating-point number with the 
greatest value less than the infinitely precise number. Round to zero returns a floating-point number with 
the largest magnitude which is less than the infinitely precise number. 

The following tables summarize the interpretation of the standard for all value ranges of real numbers 
to be rounded. G neg , G pos -> L neg , and L pos represent, respectively, the greatest negative, greatest posi- 
tive, least negative, and least positive finite floating point number representable in a given precision. The 
tables give the result and exceptions, if any, of the rounding operation for a given infinite precision number 
and a rounding mode. Three possible exceptions can be signaled by the rounding operation: underflow, 
overflow, and inexact. 

Overflow detection when the overflow trap handler is implemented and enabled will deliver to the trap 

a 

handler the infinitely precise result of the operation divided by b and then rounded. The exponent adjust- 
ment a is chosen to be approximately 3 ( (E max -E min ) /4) and should be divisible by twelve. 


Table 1: Negative numbers less than G neg 



near 

disabled 

-inf, overflow, 
inexact 

-inf, overflow, inexact 

G y inexact 

enabled 

round(r/£> a ), 
overflow, inexact 

round(r/i> a ), overflow, 
inexact 

G inexact 

neg 7 

posjnf 

disabled 

G „ , overflow, 

neg 

inexact 

G inexact 

neg 7 

G„ eg » inexact 

enabled 

round(r/b a ), 
overflow, inexact 

G neg , inexact 

G_. inexact 

neg 7 

negjnf 

disabled 

-inf, overflow, 
inexact 

-inf, overflow, inexact 

-inf, overflow, inexact 

enabled 

round(r/£ a ), 
overflow, inexact 

round(r/b a ), overflow, 
inexact 

round(r/b a ), 
overflow, inexact 

zero 

disabled 

G neg , overflow, 
inexact 

g . inexact 

neg 7 

g m , inexact 

neg 7 

enabled 

round(r/2> a ), 
overflow, inexact 

G neg , inexact 

G neg , inexact 










































When the value resulting from a rounding operation is not equal to the infinitely precise num- 
ber the inexact exception is signaled. The inexact flag is represented by excl and is defined in 
table 10. 

Table 2: Negative numbers greater than or 

£ 

equal to g and less than or equal to -b 


mode 


all modes 

normal, excl 


Underflow detection when the underflow trap handler is implemented and enabled will deliver to the 

a 

trap handler the infinitely precise result of the operation multiplied by b and then rounded. The exponent 

adjustment a is the same as used for overflow, excl is the underflow exception flag defined in table 11. 
Underflow detection depends on the rounding result, detection scheme selected by the user, and/or traps 
enabled. 


Table 3: Negative numbers greater than -b’ min and less than or equal to L 


mode 

underflow 

trap 

E E 1 

1 mtn • mm JL r 

~ h <r *- b — 2 L -t 

1 T 

- b ~2 L -'Z <r 

1 T 

r< - b ~ L neg 

- bmi '-K' ( *r*L neg 

near 

disabled or 
(enabled, no 
underflow) 

E _ 

-b m " , inexact, exc2 

-b Emim -L neg , inexact, 
exc2 

denormal, excl, exc2 

enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

pos_inf 

disabled or 
(enabled, no 
underflow) 

1 mtm m 

*** D iL « 

neg ’ 

inexact, exc2 

-b mim -L , 

neg * 

inexact, exc2 

denormal, excl, exc2 

enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 
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Table 3: Negative numbers greater than -b "" and less than or equal to L neg 


mode 

underflow 

trap 

E E 1 

■ **min i mim X T 

-b <r s-b 

1 r 

- b ~i L -n <r 

r< ~ b * L nt t 

I T T 

- b - L ne S * r * L ne S 

neg_inf 

disabled or 
(enabled, no 
underflow) 

£ . _ 
-b , inexact, exc2 

£ 

-b , inexact, exc2 

denormal, excl, exc2 

enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

zero 

disabled or 
(enabled, no 
underflow) 

-b miH -L , 

neg ’ 

inexact, exc2 

-b mi '-L neg , inexact, 
exc2 

denormal, excl, exc2 

enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b°), 
inexact, underflow 


Table 4: Negative numbers greater than l 


mode 

underflow trap 

L < r < \l 
" eg 2 ne S 


near 

disabled 

L , inexact, underflow 

neg 7 7 

-0, inexact, underflow 

enabled 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

pos_inf 

disabled 

-0, inexact, underflow 

-0, inexact, underflow 

enabled 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

neg_inf 

disabled 

L , inexact, underflow 

neg 7 7 

z, , inexact, underflow 

enabled 

round(r x b a ), 
inexact, underflow 

round(r x fe a ), 
inexact, underflow 
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Table 4: Negative numbers greater than L neg 


mode 

underflow trap 

^ neg < r < 2 ^neg 

§w*° 

zero 

disabled 

-0, inexact, underflow 

-0, inexact, underflow 

enabled 

round(r x b°), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 


Table 5: Zero 


mode 

(-0) + (-0) 

(+0) + (+0) 

sign(fpl) * sign(Jp2) 
v(fpl)+v(jp2 )= 0 

+0 x fp 

-0x//> 

near 

-0 

+0 

+0 

sign(fp)0 

sign(-fp)0 

pos_inf 

-0 

+0 

+0 

sign(fp)0 

sign(-^>)0 

neg_inf 

-0 

+0 

-0 

sign(fp)0 

sign(-fp)0 

zero 

-0 

+0 

+0 

sign(fp)0 

sign(-fp)0 


sig n (fp) is the algebraic sign of the floating point number fp. v(fp) denotes the value of the finite float- 
ing point number fp and v(fpl)+v(fp2) is the infinitely precise addition of the values of fpl and fp2. 


Table 6: Positive numbers less than l , 

pos 


mode 

underflow trap 

o<rs K~ 

\ L pos <r<L pos 

near 

disabled 

+0, inexact, underflow 

L pos , inexact, underflow 

enabled 

round(r x b a ), 
inexact, underflow 

round (r x b°), 
inexact, underflow 

pos_inf 

disabled 

l , inexact, underflow 

pos 7 7 

L pos , inexact, underflow 

enabled 

round(r x b a ), 
inexact, underflow 

round(r x b°), 
inexact, underflow 

neg_inf 

disabled 

+0, inexact, underflow 

+0, inexact, underflow 

enabled 

round(r x b a ), 
inexact, underflow 

round(r x b°), 
inexact, underflow 
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Table 6: Positive numbers less than L pos 


mode 

underflow trap 

0 <rs tL 

2 P os 

1 T r 
2. pos <r < *~‘poS 

zero 

disabled 

+0, inexact, underflow 

+0, inexact, underflow 

enabled 

round(r x fe a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 


E 

Table 7: Positive numbers greater than or equal to L pos and less than b 


mode 

underflow 

trap 

L nn'*rSb ""’-L 

pos pos 

E El 

• min r i min X , 

b -I <r <b - -L 

pos 7 pos 

min 1 j 

b - -Z, r sr<i 
2 P os 

near 

disabled or 
(enabled, no 
underflow) 

denormal, 

excl,exc2 

b miH -L pos> excl, exc 2 

b m " , excl, exc2 


enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

pos_inf 

disabled or 
(enabled, no 
underflow) 

denormal, 

excl,exc2 

b min , excl, exc2 

E 

b mim , excl, exc2 


enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b°), 
inexact, underflow 

neg_inf 

disabled or 
(enabled, no 
underflow) 

denormal, 

excl,exc2 

£ 

b - l , excl, exc2 

~ L p0 s ’ excl > exc2 


enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b°), 
inexact, underflow 

zero 

disabled or 
(enabled, no 
underflow) 

denormal, 

excl,exc2 

bEm,m - L P os’ excl > exc2 

£ 

b mim -L pos , excl, exc2 


enabled and 

underflow 

detected 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 

round(r x b a ), 
inexact, underflow 
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Table 8: Positive numbers greater than or 
equal to b mim and less than or equal to g 


mode 

t^min 

b 

pos 

all modes 

normal, excl 


Overflow detection will deliver to the trap handler the infinitely precise result divided by b° and then 
rounded, when the overflow trap handler is implemented and enabled. The exponent adjustment a is as 
defined in page 4. 


Table 9: Positive numbers greater than G pos 


mode 

Overflow 

trap 

MM 



near 

disabled 

inexact 

+inf, overflow, inexact 

+inf, overflow, 
inexact 

enabled 

G pos , inexact 

round(r/£> a ), overflow, 
inexact 

round(r/fc a ), 
overflow, inex- 
act 

pos_inf 

disabled 

+inf, overflow, inexact 

+inf, overflow, inexact 

+inf, overflow, 
inexact 

enabled 

round(r/b a ), over- 
flow, inexact 

round (r/b a ), overflow, 
inexact 

round(r/b a ), 
overflow, inex- 
act 

neg_inf 

disabled 

' <V*’ inexact 

G pos , inexact 

! 

c pos , overflow, 
inexact 

enabled 

inexact 

G pos , inexact 

round(r/b a ), 
overflow, inex- 
act 
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Table 9: Positive numbers greater than G pos 


mode 

Overflow 

trap 




zero 

disabled 

G pos , inexact 

G pos , inexact 

G pos , overflow, 
inexact 

enabled 

G pos , inexact 

G pas , inexact 

round(r/b a ), 
overflow, inex- 
act 


Exception excl depends on whether or not the rounded result is equal to the infinitely precise number. 
excl is inexact if r and round r are not equal, and no exception if they are equal. 


Table 10: excl inexact exception flag 


mode 

(round r) = r 

(round r) * r 

all 

inexact = false 

inexact = true 


Underflow exception, excl , depends on the detection of tininess, loss of accuracy, and whether the 
underflow trap is enabled or disabled. When the underflow trap is disabled, both tininess and loss of accu- 
racy must be detected to signal underflow. When the underflow trap is enabled detection of tininess results 
in an underflow flag. 


Table 11: excl underflow exception flag 


mode 

underflow trap disabled 

underflow trap enabled 

all 

tiny a loss_acc = underflow 

tiny = underflow 


Detection of tininess and loss of accuracy is user selectable. Tininess can be detected before 
or after rounding. 

Table 12: Tininess detection before 
rounding 


mode 

0 < |r| < b m ‘" 

b Em,m * M 

all 

true 

false 


10 














Table 13: Tininess detection after rounding 


mode 

E E 1 

• min | min X r 

-b <r*-b -ji.,. 


E i E 

> mm 1. > nun 

b 

2 P os 

b Emim * M 

near 

false 

true 

false 

false 

pos_inf 

true 

true 

false 

false 

negjnf 

false 

true 

true 

false 

zero 

true 

true 

true 

false 


Loss of accuracy can be detected by denormalization loss or by inexact. Detection of denormalization 
loss is defined in IEEE-854 as follows: 


A denormalization loss: When the delivered results differs from what would have been computed were 
the exponent range unbound. [6, section 7.4, page 15] 

An unbound exponent range gives p digits of accuracy regardless of the number’s magnitude. Consider for 
example the number 

r = b Emim 0.00...0 4 j , _ 1 00...04 7 „_ 1 

P P 

where, 

d p _ 1 = a * 0 and d 2p _ 1 = p *0 

This number, when rounded to near with the exponent range unbound, will result in: 

b E -~ {P ~ l) d 0 M...Qd p _ l 

where, 

d Q = a and d p _ 1 = p 

Rounding to near with the exponent range bounded will give: 
b Emi "0.00...00d n . 

P~ 1 

where, 

d * = a 

p-1 

The loss of accuracy due to rounding with the exponent range bounded is |r- (round r) \ = p x b mim 


11 

























In general, when |r- (round r) | > ifcL ,< «* r J + (1 p) (f or founding to near) the delivered result with exponent 

bound will differ from the result with exponent unbound and loss of accuracy shall be detected. Other 
rounding modes have different detection thresholds as given in the next table. 


Table 14: Loss of accuracy detection by denormalization loss 


mode 

E E • + (1-p) 

7 mtm m^mtm ' " ' 

-b <rs-b 

o < IH 

■ | E mt . + (!-/») 
1 r\<b 

b E ~"-'\ r < h e ~ 

b Emim s |r| 

near 

1 r - ( round r) | > ( ifcL lo ** r J + ( 1 P) ) 

true 


false 

posjnf 


true 

( (round r) -r) >0 

false 

neg_inf 

(r- (round r) ) > 0 

true 

(,- (rom,d 

false 

zero 

((roundr)-r) £ (i, L ‘ 0|! *' J * <I -' ) ) 

true 

(r- (round r)) (1_jP) ) 

false 


Table 15: Loss of accuracy 
detection by inexact 


mode 

Vr 

all 

(round r) * r 


5 Operations 

Implementations conforming to the IEEE-854 standard must provide the add, subtract, multiply, 
divide, square root, remainder, round to floating point integer, conversion between precisions, conversion 
between floating point and integer numbers, conversions between floating point numbers and decimal 
strings, and compare operations. The arithmetic operations are shown in tabular form for all floating-point 
arguments. 

5.1 Arithmetic 
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Table 16: Floating-point addition 


fpl add fp2 

fpl 

sig NaN 

quiet NaN 

-0 

+0 

finite *0 

-inf 

+inf 

fp2 

sig NaN 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN 

quiet 

NaN, 

invalid 

fpl V fp2 

fp2 

fp2 

fp2 

fp2 

fp2 

-0 

quiet 

NaN, 

invalid 

fpl 

-0 

round(0) 

fpl 

-inf 

+inf 

+0 

quiet 

NaN, 

invalid 

fpl 

round(0) 

+0 

fpl 

-inf 

+inf 

finite 

*0 

quiet 

NaN, 

invalid 

fpl 

fp2 

fp2 

round ( 
v(fpl)+ 
v(fp2)) 

-inf 

+inf 

-inf 

quiet 

NaN, 

invalid 

fpl 

-inf 

-inf 

-inf 

-inf 

quiet 

NaN, 

invalid 

+inf 

quiet 

NaN, 

invalid 

fpl 

+inf 

+inf 

+inf 

quiet 

NaN, 

invalid 

+inf 


v(fp) denotes the value of the finite floating point number fp and v(fpl)+v(fp2) is the infinitely precise 
addition of the values of fpl and fp2. 

Floating-point subtraction is defined in terms of floating-point addition. The unary negation operation 
will change the algebraic sign of a floating-point number by changing its sign digit. The negation oper- 
ation will change the sign of finites and infinities and will leave NaNs unchanged. 


Table 17: Floating-point 
subtraction 


fpl sub fp2 

fpl 

fp2 

fpl add (-fp2) 
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Table 18: Floating-point multiplication 


fpl mul fp2 

fpl 

sig NaN 

quiet NaN 

-0 

+0 

finite *0 

-inf 

+inf 

fp2 

sig NaN 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN 

quiet 

NaN, 

invalid 

fplVfp2 

fp2 

fp2 

fp2 

fp2 

fp2 

-0 

quiet 

NaN, 

invalid 

fpl 

+0 

-0 

sign(-fjpl) 

0 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

+0 

quiet 

NaN, 

invalid 

fpl 

-0 

+0 

sign(fpl)0 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

finite 

*0 

quiet 

NaN, 

invalid 

fpl 

sign(-fp2) 

0 

sign(fp2)0 

round( 
v(fpl) x 
v(fp2)) 

sign(-fp2) 

inf 

sign(fp2) 

inf 

-inf 

quiet 

NaN, 

invalid 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

sign(-fpl) 

inf 

+inf 

-inf 

+inf 

quiet 

NaN, 

invalid 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

sign(fpl) 

inf 

-inf 

+inf 
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Table 19: Floating-point division 


fpl div fp2 

fpl 

sig NaN 

quiet NaN 

-0 

+0 

finite *0 

-inf 

+inf 

fp2 

sig NaN 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN 

quiet 

NaN, 

invalid 

fplVfp2 

fp2 

fp2 

fp2 

fp2 

fp2 

-0 

quiet 

NaN, 

invalid 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

sign(-fpl) 

inf, 

div_zero 

+inf, 

divjzero 

-inf, 

div_zero 

+0 

quiet 

NaN, 

invalid 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

sign(fpl) 

inf, 

div_zero 

-inf, 

div_zero 

+inf, 

divjzero 

finite 

*0 

quiet 

NaN, 

invalid 

fpl 

sign(-fp2) 

0 

sign(fp2)0 

round( 

v(fpl) * 

v(fp2)) 

sign(-fp2) 

inf 

sign(Q)2) 

inf 

-inf 

quiet 

NaN, 

invalid 

fpl 

+0 

-0 

sign(-fpl) 

0 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

+inf 

quiet 

NaN, 

invalid 

fpl 

-0 

+0 

sign(fpl)0 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 
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The reminder operation x REM y is defined by x - (yxn) for non-zero values of y, where n is the 
integer nearest to x/y. 


Table 20: Floating-point remainder 




fpl 

fplREM fp2 

sig 

NaN 

quiet 

NaN 

-0 

+0 

finite * 0 

-inf 

+inf 


sig 

NaN 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet NaN, invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 


quiet 

NaN 

quiet 

NaN, 

invalid 

fplVfp2 

fp2 

fp2 

fp2 

fp2 

fp2 

fp2 

-0 

quiet 

NaN, 

invalid 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet NaN, invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

+0 

quiet 

NaN, 

invalid 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 

quiet NaN, invalid 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 


finite 

*0 

quiet 

NaN, 

invalid 

fpl 

-0 

+0 

v(fpl) - (v(fp2) x n) 
n = nearest 

integer (v(fpl)/v(fp2)) 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 


-inf 

quiet 

NaN, 

invalid 

fpl 

fpl 

fpl 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 


-i-inf 

quiet 

NaN, 

invalid 

fpl 

fpl 

fpl 

fpl 

quiet 

NaN, 

invalid 

quiet 

NaN, 

invalid 


When two possible values of n are equally near to x/y then n is even. If x — (y x rt) is zero then 
xREM y is +0 for positive x and -0 for negative x regardless of the rounding mode. Infinite arith- 
metic is defined in IEEE-854 as the limiting case of real arithmetic. To define the remainder function when 
x is finite and y is infinite, the limit is calculated as lim (r - (yx/i)) = x 


5.2 Square Root 

The square root operation is defined for all non negative floating-point numbers. The square root of -0 
shall be -0. 
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Table 21: Square root 


SQR 

fpl 

sig 

NaN 

quiet 

NaN 

-0 

+0 

finite < 0 

finite > 0 

-inf 

+inf 


quiet 

NaN, 

invalid 

fpl 

-0 

+0 

quiet 

NaN, 

invalid 

round */v(fpl) 

quiet 

NaN, 

invalid 

+inf 


5.3 Floating-point Precision Conversion 


Conversion between floating-point numbers of all precisions shall be possible. When converting from 
a lower to a higher precision the result will be exact. Conversion from a higher to lower precision may sig- 
nal inexact. 


Table 22: Floating-point precision conversions 


fpl to fp2 

fpl 

sig NaN 

quiet 

NaN 

-0 

+0 

finite * 0 

-inf 

+inf 

fp2 

preci- 

sion 

narrower pre- 
cision 

quiet 

NaN, 

invalid 

fpl 

-0 

+0 

round 

v(fpl) 

-inf 

+inf 

wider 

precision 

quiet 

NaN, 

invalid 

fpl 

fpl 

fpl 

fpl 

fpl 

fpl 


5.4 Conversion Between Floating-point and Integer 

Standard IEEE-854 specifies that compliant implementations must provide conversion between float- 
ing-point and integer number encodings. However, integer encoding is not in the scope of IEEE-854. 
When conversion from a floating-point to an integer number precludes a faithful representation, an invalid 
exception shall be raised. An exception may arise due to conversion of NaNs, infinities, or on overflow 
when the floating point value exceeds the maximum value of the integer encoding. When a representation 
does not exists for NaNs or infinities in the integer encoding, or if overflow occurs, the conversion result 
represented in table 23 by res_exc4 is undefined, invalid. 

exc3 exception flag is inexact if the value after conversion is not equal to the floating-point value 
before conversion. Conversion from an integer to a floating-point number should always be exact. If inte- 
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ger encodings of -0 and +0 is possible then conversion between floating-point and integers shall preserve 
the zero sign. When no such encoding exists for integer numbers, conversion of zero should result in +0. 


Thble 23: Floating-point to integer conversion 


fpl to Integer 

fpl 

sig NaN 

quiet 

NaN 

-0 

+0 

finite * 0 

-inf 

+inf 

rounding 

mode 

near 

quiet 

NaN, 

invalid 

or 

res_exc4 

fpl or 
res_exc4 

-0 

or 

0 

+0 

or 

0 

(if \ v(fp\) J is even 
then 

|f2 x v(fpl) 1 1 >exc3 
else 

n2xv(/pi)jj PTr3 ) 
or res_exc4 

-inf or 
res_exc4 

+inf or 
res_exc4 

pos_inf 

quiet 

NaN, 

invalid 

or 

res_exc4 

fpl or 
res_exc4 

-0 

or 

0 

+0 

or 

0 

(r v(fpl)~\,exc3) 
or res_exc4 

-inf or 
res_exc4 

+inf or 
res_exc4 

neg_inf 

quiet 

NaN, 

invalid 

or 

res_exc4 

fpl or 
res_exc4 

-0 

or 

0 

+0 

or 

0 

(Lv(/pl) J,exc3) 
or res_exc4 

-inf or 
res_exc4 

+inf or 
res_exc4 

zero 

quiet 

NaN, 

invalid 

or 

res_exc4 

fpl or 
res_exc4 

-0 

or 

0 

+0 

or 

0 

(if fpl > 0 then 
b(//?l) J,exc3 
else f v (fp 1)"J, exc3) 
or res_exc4 

-inf or 
res_exc4 

+inf or 
res_exc4 


exc3-lF integer(fp)= value(fp) THEN inexact=false ELSE inexact=true 
res_exc4 = undefined, invalid 


18 











































Table 24: Integer to floating point conversion 


Integer to fp 

Integer 

-0 

+0 orO 

finite *0 


fp 

-0 

+0 

v(fp)=(value(integer)) 


5.5 Round Floating-point Number to Integral Value 


Conversion from floating-point to integral valued floating-point rounds a floating-point number, 
according to the rounding mode, to a floating-point in the same precision with an integer value. 


Table 25: Floating-point to integral valued floating-point 




fpl 

fpl to fp2 

sig 

NaN 

quiet 

NaN 

-0 

+0 

finite * 0 

-inf 

+inf 

rounding 

near 

quiet 

NaN, 

invalid 

fpl 

-0 

+0 

v(fp2) = 

(if Iv(fpl) J is even 
then pxv(/pDlj 

else 

n2xv(/pi) n )PIt ,3 

-inf 

+inf 

mode 

pos_inf 

quiet 

NaN, 

invalid 

fpl 

-0 

+0 

v(fp2)= [v(fpl)~\, 
exc3 

-inf 

+inf 


neg_inf 

quiet 

NaN, 

invalid 

fpl 

-0 

+0 

v(fp2) = Iv(fpl) J, 
exc3 

-inf 

+inf 


zero 

quiet 

NaN, 

invalid 

fpl 

-0 

+° 

v(fp2) = (if v(fpl) > 0 
then j_v (fp 1) J else 
f v (//>!) D> exc3 

-inf 

+inf 


exc3 = IF v(fpl) = v(fp2) THEN inexact = false ELSE inexact - true 
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5.6 Conversion Between Floating-point and Decimal String 

Decimal strings are strings of characters representing decimal numbers. The format of decimal strings 
is not covered by IEEE-854. An uninterpreted function format is defined which takes the floating-point 

±N 

value v(fp) and converts it to a decimal string value represented by ±M x 10 . The function value 

extracts the value of a decimal string. 


Table 26: Decimal string to floating-point conversion 


DS to 
fp 

DS 

quiet 

NaN 

signaling 
NaN or 
NaN 

unrecog- 

nizable 

string 

-0 

■ 

-inf or 
-infinity 

+inf or 
+infinity or 
inf or 
infinity 

value(DS) 

fP 

quiet 

NaN 

signaling 

NaN 

quiet 

NaN, 

invalid 

-0 

+0 

-inf 

+inf 

round(value( 

DS)) 


Table 27: Floating-point to decimal string conversion 


fp to 
DS 

fp 

quiet 

NaN 

signaling 

NaN 

-0 

+0 

-inf 

+inf 

finite * 0 

DS 

‘NaN’ 

‘NaN’, 

invalid 

value(DS) 

=0 

value(DS) 

=0 

‘-inf’ or 
‘-infinity’ 

‘inf’ or 
‘infinity’ 

format(v(fp)) 


The function format must have the property round (value (format (v(fp)))) = fp when 
rounding to nearest and conversions from floating-point numbers to decimal strings are performed such 


\\p log 10 ( 2) + 1] b = 2 
that M has D digits of precision where D = -s 

[ p b = 10 


5.7 Comparison 

For any two arbitrary floating-point numbers, one and only one of the following relations must hold: 
“less than”, “equal”, greater than”, or “unordered”. Comparison operations can be implemented in two 
possible ways: 1) A comparison will return as a result one of the four relations above; 2) A predicate 
defines a specific relation between two floating-point numbers and the comparison returns true if the pred- 
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icate holds on the arguments and false otherwise, possibly together with an exception. Tables 28 and 29 
describe the relation and predicate implementation options. The predicates in Table 29 are defined in terms 
of the relations of Table 28. Note that for Table 28, if fpl is a NaN and fpl - fp2 the result of a comparison 
is still unordered. That is, a NaN compares unordered with itself. 

Table 28: Floating-point comparison: relations 
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Table 29: Floating-point comparison: predicates 


fpl predicate fp2 

relation 

predicates 

less than 

equal 

greater than 

unordered 

ASCII 

Fortran 

Math. 

= 

.EQ. 

= 

false 

true 

false 

false 

?<> 

.NE. 


true 

false 

true 

true 

> 

.GT. 

> 

false 

false 

true 

false, invalid 

a 

.GE. 


false 

true 

true 

false, invalid 

< 

.LT. 

< 

true 

false 

false 

false, invalid 

<= 

.LE. 

£ 

true 

true 

false 

false, invalid 

? 

.UN. , 


false 

false 

false 

true 


Part 2: Definition in HOL 


1 Introduction 

This part of the paper presents the definition of the IEEE-854 standard in the HOL system[2].The stan- 
dard is formalized using the higher-order logic language available in the system. The HOL system’s logic 
is Church’s simple theory of types with polymorphic and definitional extensions. The HOL system is a 
general purpose mechanized theorem prover. The system supports both forward and backward proofs. The 
forward proof style applies inference rules to existing theorems to obtain new theorems and eventually the 
desired theorem. Backward or goal oriented proofs start with the goal to be proven. Tactics are applied to 
the goal and subgoals until the goal is decomposed into simpler existing theorems or axioms. 

By defining the IEEE-854 standard in the HOL system, it is possible to show that the standard meets 
given requirements. Desirable properties of the standard can be formulated in the logic and proofs can be 
constructed in the system to show that the formalization of the standard complies with stated properties. 

The system basic language includes the natural numbers and boolean type. John Harrison’s reals 
library[4] and Elsa Gunter’s integer library [3] are used, respectively, for the definition of the real and inte- 
ger types. The real and integer numbers are used as part of the IEEE-854 formalization. In the HOL system 
the symbol ? represents 3 , ! represents V, and @ is the choice or Hilbert operator. Entries in the HOL 
system are represented by the courier (type-writer) font. 
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2 Floating-point Numbers and Precisions 

The four parameters defining a precision, b, p, Emax, and Emin , are defined in the HOL system by 
declaring b as a constant and placing constraints on the values of p, Emax, and Emin, b and p range over 
the natural numbers (type ”:num”) and Emax and Emin range over the integers (type ”:integer”). The value 
of constant b is either 2 or 10. 


new_de f inition ( 'b' , 

"b = @n. (n=2)\/(n=10)"); ; 


The formula “@n. (n=2 ) \ /( n=10 )” can be read “chose an n such that n=2 or n=10.” The number 


of digits p is restricted in all precisions by the constraint 1 2: 10 
b = 2, p > 17 

b = 10, p > 5 


which is algebraically equivalent to 


new_def inition ( ' Sig' , 

"Sig p = ( (b = 2) ==> (17 < p))/\ 
((b =10) ==> (5 < p))");; 


The constraint (E max -E • ) /p> 5 is imposed on the values of Emax, Emin and p by the definition, 


new_def inition ( ' single' , 

"single pr emax emin = (INT(5*pr) below (emax minus emin))");; 1 

which must be true for single precision as well as all other precisions. 

Additional constraints are imposed on the parameters for double, single extended, and double 
extended precisions. For double precision: 


b Pd s 10 b 


2 P. 


* 8 £ 


+ 7 


s8£ 


where the subscripts d and s denote double and single precision respectively, and is given by the definition, 
new_def inition ( ' double ' , 

"double ps pd emax_s emin_s emax_d emin_d = 

(single pd emax_d emin_d)/\ 

(b = 2) ==> ((4 + ( 2 *p_s ) ) <= p_d)/\ 


1. The natural, integer, and real numbers are different types in the HOL system and different operators and relations are 
defined on these types. The relations below, below_or_e, minus, plus, and times on the integers have the obvious meaning of less 
than, less than or equal, subtraction, addition, and multiplication, respectively. The operator INT takes a natural number and maps 
it into an integer number. 
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(b =10) ==> ((1 + ( 2 *p s ) ) <= p_d)/\ 

(((INT 8 times emax_s) plus INT 7) below_or_e emax_d)/\ 
((emin_d below_or_e (INT 8 times emin_s) ) ) ") ; ; 

For extended precision, the following constraints must hold over the base precision: 


E mar 

^ m in e 


*8 E 

s8 E. 


max b 


+ 7 


p e *\.2p h 

for b = 2 p e *p b + I - log 2 ( E maXb -E min ) ] 


where the subscripts e and b denote the extended and base precision. These constraints are defined by, 
new_definition( 'extended" , 

"extended p_b p_e emax_b emin_b emax_e emin_e = 

(((INT 8 times emax_b) plus INT 7) below_or_e emax_e)/\ 

(emin_e below_or_e (INT 8 times emin_b))/\ 

(&p_e real_ge ((&1 real_add (&2/(&10))) real_mul (& p_b)))/\ 

( ( b=2 ) ==> 

((p_b + ceiling (log 2 (& (FST ( REP_integer ( emax_b minus emin_b ) ) ) ) ) ) 
<= p_ e ))");; 2 

A floating-point number of any given precision must have an exponent value within the precision max- 
imum and minimum exponent. The digits must be b-radix based. 

new_def inition( 'precis' , 

"precis emax emin fp = 

(emin below_or_e (exponent fp))/\ 

((exponent fp) below_or_e emax)/\ 

(In. (digits fp)n < b)");; 

An implementation of the IEEE-854 will assign specific values to b,p, E max , and E min . These values 

must be shown to comply with the restrictions above. For example, for an implementation with single and 
double precision with values, 

b=2, p_s = 24, Emax_s = 127, Emin_s = -126, p_d = 53, Emax_d = 1023, and Emin_d = -1022 
we must show that, 

"b=2 ==> ( b=2 ) \/ ( b=10 ) " ; ; 

"(b=2)/\(p_s=24) ==> (((b = 2) ==> (17 < p_s))/\ 


2. The ceiling function in this definition takes a real number as its argument and returns a natural number. The ceiling func- 
tion of “x x real” is the least positive integer “n : num” greater than or equal to x. If x is negative, ceiling of x is zero, log 2 x is the 
logarithm base 2 of x. Arithmetic operators on the real numbers are prefixed by real as for example in real_add for the binary 
infixed addition operation. The operator takes a natural number and maps it to the reals. The numbers &1, &2, ... are reals 
with values 1, 2, ... 
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( (b =10) ==> (5 < p_s )))"?? 

"single 24 (INT 127) (neg (INT 127))";; 

"double 24 53 (INT 127) (neg (INT 127)) (INT 1023) 

(neg (INT 1022))";; 


2.1 Floating-point Number Representation 


Floating-point numbers are represented in HOL by their meaning: a value, and infinite, and a NaN. A 
new type is created to define floating point numbers: 

define_type 'fp_num' 

' f p_nuin = finite (num#integer#(num -> num) ) | 
infinite num | 

NaN (NaN_type#num) ‘ ; ; 

“finite”, “infinite”, and “NaN” become type constructors that when applied to a triple of type 
“:(num#integer#(num -> num))”, an element of type “:num”, and a pair of type 

“: (NaN_type#num ) ”, respectively, will return an element of type “ : f p_num”. 

A new type is used in the definition of “f p_num” above which defines signaling and quiet NaNs: 

define_type 'NaN_type' 'NaN_type = signal | quiet';; 

The following definitions for identifying and manipulating floating-point(fp) numbers are used in the 
specification of floating-point operations: 

new_def inition ( ' is_f inite' , 

"is_finite fp = (?X.fp = (finite X))");; 

new_def inition ( 'is_inf inite' , 

"is_inf inite fp = (?X.fp = (infinite X))");; 

new_def inition ( ' is_NaN' , 

"is_NaN fp = (?X.fp = (NaN X))");; 

new_def inition ( ' i_f inite' , 

"i_f inite fp = (0X.fp = (finite X))");; 

new_def inition ( ' i_in finite' , 

"i_inf inite fp = (0X.fp = (infinite X))");; 

new_def inition ( ' i_NaN' , 

"i_NaN fp = ( 0X. fp = (NaN X))");; 

The first three definitions are predicates which return true when applied to a finite, infinite, and NaN fp 
number, respectively, and false otherwise. The last three definitions are the inverse of the respective type 
constructors and will return the argument of the constructor when applied to the appropriate fp number. 
The theorems. 
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|- Iz. infinite (finite z) = z 
j- lz.i__infinite (infinite z) = z 
j- 1 z . i_NaN (NaN z) = z 

illustrate the action of the inverse functions. 

Additional definitions following extract the elements of the arguments for floating point constructors 
and identify properties of the argument: 

new_def inition ( ' fp_sign_d' , 

"fp_sign_d fp = (@n.(n = (FST (i_finite fp)))\/ 

(n = (i_infinite fp) ))");; 

define_type 'fp_sign' 

'fp_sign = positive | negative';; 

new_def inition ( ' f p_sign' , 

"fp_sign fp = (EVEN (fp_sign_d fp) ) => positive | negative");; 
new_de f inition ( 'exponent' , 

"exponent (s :num,Exp: integer, dig: num -> num) * Exp");; 
new_def inition ( ' digits ' , 

"digits (s :num, Exp: integer, dig: num -> num) = dig");; 

new_definition( ' fp_is_pos' , 

"fp_is_pos fp = fp_sign fp = positive");; 

new_de f inition ( ' fp_is_neg' , 

"fp_is_neg fp = fp_sign fp = negative");; 

new_def inition ( 'fp_is_zero' , 

"fp_is_zero fp = ( In. ( fp_digits fp)n = 0)");; 

The greatest and least magnitudes for a finite floating point number is given by: 
new_def inition ( 'Gpos' , 

"Gpos (emax: integer ) = ( 0, emax, ( \d:num.b-l ) ) " ) ; ; 
new_def inition ( 'Gneg' , 

"Gneg (emax: integer ) = ( 1, emax, (\d: num. b-1 ))"); ; 
new_def inition ( 'Lpos' , 

"Lpos p (emin: integer ) = ( 0,emin, \d:num. (d = (p-1)) => 1 | 0)");; 
new_def inition ( 'Lneg' , 

"Lneg p ( emin : integer ) = ( 1, emin, \d: num. (d = (p-1)) => 1 | 0)");; 
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3 Exceptions and TVaps 

Operations on floating-point numbers can signal exceptions as a result of performing the operation. 
Exceptions are declared as a new type: 

define_type 'except' 'except_type = invalid | div_by_zero | 
overf low__w_inex | underflow | under flow_w__inex | inexact | 
no_excep' ; ; 

A signaling exception will set a status flag and, if enabled by the user, will invoke an exception han- 
dling trap. If exceptions handlers are implemented then each exception should have a user controlled trap 
associated with it. 

The resulting value on some operations will depend on whether an exception is detected and/or a trap 
is enabled. Conditions which will result in exceptions will be defined within the operation’s definition. The 
status of the exception traps are defined by the 5-tuple (invalid, divjbyjzero, overflow, underflow, inexact). 
The following functions extract the status of each of the exception traps: 

new_def inition ( ' invalid_t' , 

"invalid_t (tl:bool,t2:bool,t3:bool,t4:bool,t5:bool) = tl");; 
new_de f inition ( 'div_by_zero_t' , 

"div_by_zero_t ( tl :bool, t2 :bool, t3 :bool, t4 :bool, t5:bool) = t2");; 
new_definition( 'overflow_t' , 

"overflow_t (tl:bool / t2 s bool, t3: bool, t4: bool, t5: bool) = t3");; 
new_def inition ( 'underf low_t' , 

"underflow_t (tl:bool,t2:bool,t3:bool,t4:bool,t5:bool) = t4");; 
new_def inition ( ' inexact_t' , 

"inexact_t (tl:bool,t2:bool, t3:bool,t4:bool,t5:bool) = t5");; 


4 Rounding 

Rounding will take an infinitely precise number r, characterized in the HOL system by the real num- 
bers, and convert it into a floating-point representation. Four rounding modes are specified in the standard. 
The rounding mode is declared as a new type: 

define_type 'round_m' 'round_m = to_near | 

to__pos_inf | to_neg_inf | to_zero' ; ; 

The rounding operation is defined by a family of functions to cover all rounding modes and argument 
values. The first set of function is defined for values of r which will generate a finite floating-point repre- 
sentation. A function is defined for each of the four rounding modes. These four functions take a real num- 
ber, a rounding precision, and a destination precision predicate and return a finite floating-point number. 
The real number argument must have value such that it can be represented by a finite floating point for the 
given destination precision. Round to near is defined by, 

new_def inition ( 'round2near' , 
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"round2near r p precis = 

(?f pi. precis fpl /\ 

(ifp. (precis fp) /\~(fp_value fp p = fp__value fpl p) *=> 
abs(fp_value fpl p real_sub r) real_lt abs(fp_value fp p real_sub r))) => 
(§f pi. precis fpl /\ 

(Ifp. (precis fp) /\~(fp_value fp p = fp_value fpl p) ==> 

abs(fp_value fpl p real_sub r) real_lt abs(fp_value fp p real_sub r))) | 

§f pi. (precis fpl) /\ 

(Ifp. (precis fp) ==> 

abs(fp_value fpl p real_sub r) real_le abs(fp_value fp p real_sub r)) /\ 
(EVEN ((digits fpl) (p-1) ))")?? 

Round to near will return a floating-point number with a unique value nearest to the real number, if one 
exists. If two floating point numbers have values equally near, round to near will return the one with least 
significant digit even. Round to near uses the function “fp_value” which extracts the value of a floating- 
point number returning a real number. The function “f p_value” is defined by, 

new_def inition ( ' fp_value' , 

"fp_value (s, Exp, dig) p = 

( (real_neg (& 1)) pow s) real_mul 

((NEG Exp => ( real_inv (&( b EXP ( SND (REP_integer Exp))))) j 

( & (b EXP (FST (REP_integer Exp)))))) real_mul 
(frac_sum (\dn.& (dig dn) real_mul (real_inv (&(b EXP dn)))) p)");; 

The value function “f p_value” depends in turn on the summation function 

m- 1 

“frac_sum Fn m” = \ Fn(n) , 

n = 0 

new_prim_rec_def inition ( " f rac_sum" , 

"(frac_sum Fn 0 = & 0)/\ 

(frac_sum Fn (SUC n) = 

(Fn n) real_add (frac_sum Fn n))");; 

Round to positive infinity returns the smallest floating-point number greater than r: 

new_de f inition ( 'round2pinf ' , 

"round2pinf r p precis = 

@fpl.(r real_le (fp_value fpl p))/\ 

(precis fpl)/\ 

( lfp.r real_le (fp_value fp p) ==> 

(fp_value fpl p) real_le ( f p_value fp p))");; 

Round to negative infinity returns the largest floating-point number less than r: 

new_def inition ( 'round2ninf ' , 

"round2ninf r p precis = 

0fpl . ( (fp_value fpl p) real_le r)/\ 

(precis fpl)/\ 

(Ifp. ( f p_value fp p) real_le r ==> 

(fp_value fp p) real_le (fp_value fpl p))");; 
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Round to zero returns the largest magnitude floating-point number with magnitude less than the 
magnitude of r: 

new_definition( 'round2 zero' , 

"round2zero r p precis = 

0f pi . ( precis f pi ) /\ 

( ifp. abs(fp_value fp p) real_le (abs r) ==> 

abs(fp_value fp p) real_le abs(fp_value fpl p))")?; 

The next set of rounding functions is defined for real number arguments with unbound values. The real 
number value may be outside the representable range of finite floating-point numbers. When rounding is 
performed on unbounded real arguments, the rounding function must check for overflow and return an 
overflow exception flag when overflow is detected. The rounding functions will also check for underflow 
and inexact, and return the appropriate flag when an exception is detected. 

The functions take as arguments a real number, rounding precision, traps status, rounding mode, tini- 
ness detection flag, accuracy detection flag, and destination maximum and minimum exponent. They 
return a floating point number and an exception flag. 

Underflow and inexact detection are handled by separate functions outside the rounding operation. 
Overflow is detected inside the rounding function. Also, when the real number to be rounded has magni- 

E 

tude less than b m,m the rounding is handled by a separate function “denormal”. 

The functions “tininess”, “accuracy”, and “under f 1” are used for underflow detection. The 
function “inex” is used both for underflow and inexact detection: 


new_definition( 'tininess' , 

"tininess r p mode tiny emax emin = 
let round = ( (mode = to_near) => round2near | 

(mode = to_pos_inf) => round2pinf | 

(mode = to_neg_inf) => round2ninf | 

round2zero ) in 


-tiny => (~(&0 = r) /\ 

abs(r) real_lt (real_inv (&(b EXP ( SND (REP_integer emin)))))) | 

(-(&0 = r) /\ 

fp_value (round r p (precis_c emax emin)) p real_lt 

(real_inv (&(b EXP (SND (REP_integer emin))))))" );; 


new_definition( 'accuracy' , 

"accuracy r p mode acc emax emin = 

let round = ( (mode = to_near) => round2near | 

(mode = to_pos_inf ) => round2pinf | 

(mode = to_neg_inf) => round2ninf | 

round2zero ) in 

-acc => ~(fp_value (round r p exp_unbound) p = r) | 

~(fp_value (round r p (precis_c emax emin)) p = r)" );; 


new_def inition( 'underfl' , 

"underfl r p traps mode tiny acc emax emin = 
let u = (-(underf low_t traps) => 

(tininess r p mode tiny emax emin /\ accuracy r p mode acc emax emin) | 
tininess r p mode tiny emax emin) in 
u => underflow | no_excep" );; 
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new_def inition ( ' inex' , 

"in ex r p mode emax emin = 

let round = ( (mode = to_near) => round2near | 

(mode = to_pos_inf) => round2pin£ | 

(mode = to_neg_inf) => round2ninf | 

round2zero ) in 

(fp_value (round r p (precis_c emax emin)) p = r) => no_excep | 

inexact" ) ; ; 

When both underflow and inexact are detected the exception flag becomes under f low_w_inex : 
new_def inition ( 'under flow_inexact' , 

" under flow_inexact r p traps mode tiny acc emax emin = 

( (underfl r p traps mode tiny acc emax emin = underflow) /\ 

(in ex r p mode emax emin = inexact)) => under flow_w_inex | 

(underfl r p traps mode tiny acc emax emin = underflow) => underflow | 

(inex r p mode emax emin = inexact) => inexact | 

no_excep" ) ; ; 

If overflow is detected during rounding and the overflow trap handler is enabled the result of the oper- 

L a 

ation will be the infinitely precise result of the operation divided by b and then rounded. The exponent 
adjustment a is chosen to be approximately 3 ( (E max -E min ) /4) and should be divisible by twelve 3 : 

new_definition( 'alpha' , 

"alpha emax emin = 

let app = (3*(FST (REP_integer (emax minus emin)))) in 

let q = @n. (48*n) < app /\ app < (48*(n+l)) in 

(app - (48*q) ) < ((48*(q+l)) - app) => 12*q | 12*(q+l)");; 

new_def inition ( ' r_to_near ' , 

"r_to_near r p traps mode tiny acc emax emin = 

let thr = (&(b EXP ( FST (REP_integer emax)))) real_mul 

(&b real_sub ( real_inv( & ( b EXP (p-l)))/&2)) in 

let bemin = (real_inv (&(b EXP ( SND (REP_integer emin))))) in 

(r = &0) => (finite (0,INT 0,\n.0)), no_excep | 

abs(r) real_lt bemin => denormal r p traps mode tiny acc emax emin | 

abs(r) real_lt thr => (finite (round2near r p (precis_c emax emin))), 

inex r p mode emax emin | 

(overflow_t traps) => finite (round2near (r/(&(b EXP (alpha emax emin)))) 

p (precis_c emax emin)), over f low_w_inex | 

infinite (rsign r), overf low_w_inex ");; 


3. The definition of "alpha” is overly restrictive since it gives the exponent adjustment the nearest value to 
3 ( (E max ~E min ) /4) which is divisible by 12. If an implementation description uses a different value for the exponent adjust- 
ment and a proof of compliance is to be performed, a new value for "alpha” should be defined consistent with the intended 
implementation. 
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The case for Jr| < b "" is handled by the “denormal” function. If underflow is detected and the 

underflow trap is enabled, denormal will return the infinitely precise result multiplied by b a and then 
rounded with the selected rounding mode: 

new_def inition ( 'denormal' , 

"denormal r p traps mode tiny acc emax emin = 
let round = ( (mode = to_near) => round2near | 

(mode = to_pos_inf) => round2pinf | 

(mode = to_neg_inf) => round2ninf j 

round2zero ) in 

( ~ (under flow_t traps )\/ 

( ( underf low_t traps ) /\ 

(underfl r p traps mode tiny acc emax emin = no_excep) ) ) => 

( (is_zero (round r p (precis_c emax emin))) => 

(finite (rsign r,INT 0,\n.0)), underf low_w_inex | 

(finite (round r p (precis_c emax emin))), 

underf low__inexact r p traps mode tiny acc emax emin 

) I 

(finite (round (r real_mul (&(b EXP (alpha emax emin)))) p 
(precis_c emax emin)), 

underf low_inexact r p traps mode tiny acc emax emin)");; 
Round to positive infinity: 

new_def inition ( 'r_to_pinf ' , 

“r to pinf r p traps mode tiny acc emax emin = 

let thr = real_neg (&(b EXP (FST (REP_integer emax) +1))) in 

let bemin = (real_inv (&(b EXP ( SND (REP_integer emin))))) in 

(r = &0) => (finite (0,INT 0,\n.0)), no_excep | 

abs(r) real_lt bemin => denormal r p traps mode tiny acc emax emin | 

((thr real_lt r) /\ (r real_le (fp_value (Gpos emax) p) ) ) 

=> (finite (round2pinf r p (precis_c emax emin))), 
inex r p mode emax emin | 

(overflow_t traps) =>finite (round2pinf (r/(&(b EXP (alpha emax emin)))) 

p (precis_c emax emin)), overf low_w_inex | 

( (r real_le thr) => finite (Gneg emax), overf low_w_inex | 

infinite 0, overf low_w_inex )")?? 

Round to negative infinity: 

new_def inition ( ~ r_to_ninf ' , 

"r_to_ninf r p traps mode tiny acc emax emin = 

let thr * ( & (b EXP (FST (REP_integer emax) +1))) in 

let bemin = (real_inv (&(b EXP (SND (REP_integer emin))))) in 

(r = &0) => (finite (0,INT 0,\n.0)), no_excep | 

abs(r) real_lt bemin => denormal r p traps mode tiny acc emax emin | 

(((fp_value (Gneg emax) p) real_le r) /\ (r real_lt thr)) 

=> (finite (round2ninf r p (precis_c emax emin))), 

inex r p mode emax emin | 

(overflow_t traps) => finite (round2ninf (r/(&(b EXP (alpha emax emin)))) 
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p (precis_c emax emin) ) , overf low_w_inex | 
( (thr real_le r) => finite (Gpos emax), overf low_w_inex | 

infinite 1, over f low_w_inex )");; 


Round to zero: 

new_def inition ( 'r_to_zero' , 

"r_to_zero r p traps mode tiny acc emax emin = 

let thr = (&(b EXP (FST (REP_integer emax) +1))) in 

let bemin = (real_inv (&(b EXP (SND (REP_integer emin))))) in 

(r = SO) => (finite (0,INT 0,\n.0)), no_excep | 

abs(r) real_lt bemin => denormal r p traps mode tiny acc emax emin | 

abs(r) real_lt thr => (finite (round2zero r p (precis_c emax emin))), 

inex r p mode emax emin | 

( overf low_t traps) => finite (round2zero (r/(6(b EXP (alpha emax emin)))) 

p (precis_c emax emin)), overf low_w_inex | 
r real_lt &0 => finite (Gneg emax), overf low_w_inex j 

finite (Gpos emax), overf low_w_inex ")?; 

The function "round” is the main function defining rounding. It uses the previous functions to define the 
rounding operation for all rounding modes and value ranges: 

new_def inition ( ' round' , 

"round r p traps mode tiny acc emax emin = 

(mode = to_near) => r_to_near r p traps mode tiny acc emax emin | 

(mode = to_pos_inf) => r_to_pinf r p traps mode tiny acc emax emin | 

(mode = to_neg_inf) => r_to_ninf r p traps mode tiny acc emax emin | 

r_to_zero r p traps mode tiny acc emax emin " ) ? ; 


5 Operations 

In accordance with the IEEE-854 standard, 

... each operation shall be performed as if it first produced an intermediate result correct to infinite 

precision and with unbound range, and then coerced this intermediate result to fit in the destination’s 

precision. [6, section 5, page 10] 

Infinite precision for a floating-point operation is represented in the HOL system by the real numbers. 
When an operation is to be performed where the argument or arguments are finite floating-point numbers, 
the arguments are converted to real numbers, the operation is performed in real number arithmetic and the 
result is rounded according to the selected rounding mode. When operations are performed on arguments 
of different precisions, the lower precision argument is converted to the higher precision. The function 
“p_conv” define such conversion: 

new_def inition ('p_conv', 

"p_conv fp ps pi = is_infinite fp => fp | 

is_NaN fp => fp i 

finite (fp_sign_d fp, fp_exponent fp,\n.n < ps => fp_digits fp n | 

n < pi => 0 j 

§n.n < b) " ) ; ; 
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“p_ c °nv” converts a fp from a lower precision with “ps” significant digits to a higher precision with 
“pi” significant digits. 


5.1 Arithmetic 


Five operations are defined: addition, subtraction, multiplication, division, and remainder. “fp_arith” is 
then defined as an executive function which checks for NaN operands, normalizes the operands, and call 
an arithmetic operation based on the argument “op”. The operations are declared as a new type.* 

define_type ~arith_op' 'arith_op = fpadd | fpsub | fpmul | fpdiv | 
fprem ' ; ? 

Some arguments to an operation might not be valid arguments depending on the operation. Division by 
zero is an example. If arguments to an operation are invalid the operation will return a NaN with an excep- 
tion flag. Floating-point addition is defined by the function “fp_add”. “fp_add” takes as arguments 
floating-point operands “f pi” and “fp2”, operands’ precsion “p”, rounding precision “pr”, quiet NaN 
argument “cn”, trap status “traps”, rounding mode “mode”, tininess detection flag “tiny”, accuracy 
detection flag “acc”, and maximum and minimum exponents “emax” and “emin”: 


new_def inition ( ' f p_add' , 

"fp_add fpl fp2 p pr cn traps mode tiny acc emax emin = 

(is_infinite fpl /\ is_infinite fp2 /\ -(fp_sign fpl = fp_sign fp2)) 

=> ( NaN ( quiet , cn ) , invalid ) | 

(is_infinite fpl) => (fpl,no_excep) | 

(is_infinite fp2) => (fp2,no_excep) | 

((fp_is_zero fpl ) /\ ( fp_is_zero f p2 ) A ( fp_sign fpl = fp_sign fp2)) 

=> ( fpl ,no_excep) | 

round ((fp_value (i_finite fpl) p) real_add 

(fp_value (i_finite fp2) p)) pr traps mode tiny acc emax emin");? 

Floating-point negation changes the arithmetic sign of a floating-point number which is not a NaN: 

new_def inition ('fp_neg', 

"f p_neg fp = 

is_NaN fp => fp | 

is_finite fp => (fp_is_pos fp => (finite (1,(SND (i_finite fp) ) ) ) | 

finite (0,(SND (i_finite fp)))) j 

(fp_is_pos fp => infinite 1 | infinite 0)");; 


Subtraction is defined in terms of negation and addition: 


new_def inition ( ' f p_sub' , 

"fp_sub fpl fp2 p pr cn traps mode tiny acc emax emin = 
fp_add fpl (fp_neg fp2) p pr traps mode tiny acc emax emin");? 


Floating-point multiplication: 

new_def inition ( ' f p_mul~ , 

"fp_mul fpl fp2 p pr cn traps mode tiny acc emax emin = 

( (is_infinite fpl /\ fp_is_zero fp2 ) \/ ( f p_is_zero fpl /\ is_infinite fp2)) 
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=> ( NaN ( quiet ,cn), invalid ) | 

( (is_infinite fpl \/ is_infinite fp2)/\ 

(fp_sign fpl = fp_sign fp2)) => (infinite O,no_excep) | 

( (is_infinite fpl \/ is_infinite fp2)/\ 

~(fp_sign fpl = fp_sign fp2)) => (infinite l,no_excep) | 

( ( ( f p_is_zero fpl ) \/ ( f p_is_zero f p2 ) ) /\ 

(fp_sign fpl « fp_sign fp2)) => (finite (0,INT 0,\n.0) ) ,no_excep | 

( ( ( f p_is_zero fpl ) \/ (fp_is_zero fp2))/\ 

~(fp_8ign fpl = fp_sign fp2)) => (finite (1,INT 0,\n.0) ) ,no_excep | 

round ((fp_value (i_finite fpl) p) real_mul 

( f p_value (i_finite fp2) p) ) pr traps mode tiny acc emax emin");; 

Floating-point division: 

new_definition ('fp_div", 

"fp_div fpl fp2 p pr cn traps mode tiny acc emax emin = 

((fp_is_zero fpl /\ fp_is_zero fp2)\/(is_infinite fpl /\ is_infinite fp2)) 

=> ( NaN ( quiet , cn ) , invalid ) | 

( ( f p_is_zero f p2 ) / \ 

(fp_sign fpl = fp_sign fp2)) => (infinite 0 , div_by_zero ) | 

( ( f p_is_zero f p2 )/\ 

~(fp_sign fpl = fp_sign fp2)) => (infinite l r div_by_zero) j 

round ((fp_value (i_finite fpl) p) / (fp_value (i_finite fp2) p)) pr 
traps mode tiny acc emax emin");; 

Floating-point remainder is defined by IEEE-854 as follows: 

When y m o, the reminder r = x REM y is defined regardless of the rounding mode by the mathematical 
relation r = x -y n, where n is the integer nearest the exact value xly; whenever | n -x/y\ = 1/2, then n 
is even. If r - 0 , its sign shall be that ofx. [6, section 5.1, page 10] 

The remainder function is defined in HOL by: 

new_def inition ('fp^em', 

"fp_rem fpx fpy p (prsnum) cn traps (mode:round_m) tiny acc emax emin = 
(is_infinite fpx \/ fp_is_zero fpy) => ( NaN ( quiet ,cn) , invalid) | 

(is_infinite fpy) => (fpx,no_excep) | 

(let r = (fp_value (i_finite fpx) p real_sub 
((fp_value (i_finite fpy) p real_mul 
( real_to_int_real 

((fP—Value (i_finite fpx) p) / (fp_value (i_finite fpy) p ) ) ) ) ) ) 
in 

(r = &0) => 

(finite (fp_sign_d fpx,INT 0, \n.0) ) ,no_excep | 
round r p traps to_near tiny acc emax emin ) " ) ; ; 

The integer n nearest the exact value x/y is obtained using the function “real_to_int_real”. 
The function “real_to_int_real”, defined in section 5.8, takes a real number and returns the nearest 
real number with integer value. 

“fp_arith” is the executive arithmetic function which filters NaN operands, normalizes the oper- 
ands, and selects the arithmetic operation. When both operands are quiet NaNs the argument “sel” deter- 
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mines which of the operands will be returned by the arithmetic operaion. The arguments “pi” and “p2” 
determine the operands’ precision and if normalization is needed. “pr” is the number of significant digits 

used in the rounding operation 4 . The destination precision is the largest of the two precisions, in accor- 
dance with IEEE-854 standard requirements. 


new_def inition ( ' f p_arith' , 

"fp_arith op fpl fp2 pi p2 pr cn sel traps mode tiny acc emaxl emax2 eminl 
emin2 = 


( \f pic, fp2c,p, emax, emin. 


( ( is_s_NaN fpl) 
((is_q_NaN fpl) 
(is_q_NaN fpl) 
(is_q_NaN fp2) 


(op 

(op 

(op 

(op 


fpadd) 

fpsub) 

fpraul) 

fpdiv) 


((pi = P 2) => 
(pi < p2 ) => 


\/ 

/\ 


( is_s_NaN f p2 ) ) 
( is_q_NaN f p2 ) ) 


=> 

=> 

=> 

=> 


( NaN ( quiet ,cn), invalid ) 

((sel => fpl | fp2 ) ,no_excep) 

( fpl , no_excep ) 

(fp2,no_excep) 

=> fp_add fplc fp2c p pr cn traps mode tiny acc emax emin 

=> fp_sub fplc fp2c p pr cn traps mode tiny acc emax emin 

=> fp_mul fplc fp2c p pr cn traps mode tiny acc emax emin 

=> fp_div fplc fp2c p pr cn traps mode tiny acc emax emin 

fp_rem fplc fp2c p pr cn traps mode tiny acc emax emin) 
( f pi, f p2, pi , emaxl, eminl) | 

((P_conv fpl pi p2 ) , f p2 , p2 , emax2 , emin 2 ) | 

(fpl,(p_conv fp2 p2 pi) , pi, emaxl, eminl) )"); ; 


5.2 Square root 


The result is defined and is positive for all operands greater than zero, except that sqr of -0 is -0. 
new_def inition ( ' f p_sqr' , 

"fp_sqr fp p pr cn traps mode tiny acc emax emin = 

(is_s_NaN fp) => ( NaN ( quiet ,cn) , invalid) | 

(is_q_NaN fp) => (fp,no_excep) | 

((fp_is_neg fp) /\ ~(fp_is_zero fp)) => ( NaN ( quiet, cn ), invalid) j 

(is_infinite fp) => (infinite 0,no_excep) | 

(fp_is_zero fp) => fp, no_excep | 

(round (sqrt (fp_value (i_finite fp) p)) 

pr traps mode tiny acc emax emin)");; 


5.3 Precision conversions 


Conversion between floating-point numbers of all precisions shall be possible. When converting from 
a lower to a higher precision the result will be exact. Conversion from a higher to lower precision may sig- 
nal inexact. 

new_def inition ( ' f p_p_conv' , 

"fp_p_conv fp pi p2 cn traps mode tiny acc emax2 emin2 = 


4. In most cases rounding precision is the precision of its destination. However, in systems where the result is always deliv- 
ered to double or extended destinations, the user has the option of specifying a lower rounding precision than the destination preci- 
sion. The result will be stored with the exponent range of the higher precision. 
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(is_s_NaN fp) => ( NaN( quiet, cn ), invalid) | 

(is_q_NaN fp) -> (fp,no_excep) | 

(is_infinite fp) => (fp,no_excep) | 

(pi < p2) => ((p__conv fp pi p2 ) , no_excep ) | 

(pi * p2) => (fp,no_excep) j 

round (fp_value (i_finite fp) pi) p2 traps mode tiny acc emax2 emin2");; 


5.4 Conversion between Floating-point and Integer 

Floating-point to integer conversion is defined in HOL by converting finite floating-point numbers to 
an integer number and converting infinities and NaNs to an unspecified integer number: 


new_definition ( 'fp_int_conv' , 
"fp_int_conv fp p mode = 
(is_s_NaN fp) 

(is_q_NaN fp) 

(is_infinite fp) 

(fp_is_zero fp) 


*> ran_int, invalid 
=> ran_int , no_excep 
=> ran_int, no_excep 
=> INT 0, no_excep 

finite2int fp p mode 


exc3 f p p mode" ) ; ; 


The unspecified integer number is: 

new_def inition ('ran_int', 

"ran_int = @N : integer .T" )? ; 

The conversion of non-zero finite floating-point numbers (of type “:fp_num”) to integer numbers (of 
type “ j integer”) is defined by : 

new_def inition ( 'finite2int' , 

"finite2int fp p mode = 

let r = abs(fp_value (i_finite fp) p) in 
let n = 

( (mode = to_near) => 

( (EVEN (floor r)) => 

floor (& (ceiling (&2 real_mul r))/&2) | 

ceiling (&( floor (&2 real_mul r))/&2) ) | 

(mode = to_pos_inf) => 

( (fp_is_neg fp) => 
floor r | 

ceiling r ) | 

(mode = to_neg_inf) => 

( (fp_is_neg fp) => 
ceiling r | 

floor r ) | 

floor r ) in 
fp_is_neg fp => neg (INT n) | 

INT n ");; 

The function “exc3” defines the inexact exception flag for the floating-point to integer conversion: 

new_def inition ( ' exc3 ' , 

"exc3 fp p mode = 

let r = abs(fp_value (i_finite fp) p) in 
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let N * finite2int fp p mode in 
fp_is_neg fp => 

( (abs r * & ( SND (REP_integer N))) => no_excep | 

inexact ) | 

( ( r * & (FST (REP_integer N) ) ) => no_excep | 

inexact ) " ) ; ? 

Integer to floating point is accomplished by converting the integer number to a real number and using 
the round function to obtain a floating-point number. 

new_definition( 'int_fp_conv' , 

"int_fp_conv N p traps mode tiny acc emax emin = 
let r = ( NEG N => real_neg (&(SND (REP_integer N))) | 

& (FST (REP_integer N)) ) in 
FST ( round r p traps mode tiny acc emax emin ) " ) ; ; 


5.5 Conversion of Floating-point to Integral valued floating-point 

Conversion of floating-point to an integer valued floating-point is defined by conversion of floating- 
point to an integer and from integer back to floating point. This conversion leaves infinities and quiet NaNs 
unchanged and generates an invalid exception for signaling NaNs. 


new_def inition ( ' f p_f p_int_conv' , 

"fp_fp_int_conv fp p cn traps mode tiny acc emax emin * 


=> 


(is_s_NaN fp) 

(is_q_NaN fp) 

(is_infinite fp) 

(fp_is_zero fp) => 

((exc3 fp p mode = no_excep) => 
fp, no_excep 

(int_fp_conv (finite2int fp p mode) 
inexact ) " ) ; ; 


( NaN ( quiet , cn ) ) , invalid 
=> fp, no_excep 
=> fp, no_excep 
fp, no_excep 


p traps mode tiny acc emax emin ) , 


When the conversion from floating-point to integer is exact (exc3 = no_excep) the floating-point num- 
ber already has an integral value and no conversion is necessary. When the conversion from floating-point 
to integer is inexact conversion takes place and the inexact exception is raised. 


5.6 Conversion between floating-point and decimal string 

Decimal strings are strings of characters representing decimal numbers or a string of characters repre- 
senting non-valued entities. A partial characterization of a decimal string is performed in HOL by defining 
a new type: 

define_type 'decimal_string v 'decimal_string = quiet_nan | signaling_nan | 
nan | unrecognizable | nzero | pzero | zero | ninf | pinf | inf | 
format real ' ; ; 

The elements “nzero”, “pzero”, “zero”, and “format real” represent values. The elements 
“quiet_nan”, “signaling_nan”, “nan”, “unrecognizable”, “ninf”, “pinf”, and “inf” 
represent non-valued decimal strings. Note that there exists an overlap in the representation of the value 0. 
The value of a decimal string “format real” is the argument “real” to the type constructor “format”. 
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new_definition( 'ds_value' , 

"ds_value ds = §r.ds = (format r)");; 


The floating-point number corresponding to each of the decimal string elements is given by the function: 


as_ip_cgnv 

cn traps 


=> 

=> 

=> 


new_def inition ( 
"ds_fp_conv ds p 
(ds = quiet_nan) 

(ds * signaling_nan) 

(ds = nan) 

(ds = unrecognizable) => 

(ds = nzero) => 

(ds = pzero) => 

(ds = zero) => 

(ds = ninf) => 

(ds = pinf) => 

(ds = inf) => 


mode tiny acc emax emin = 

( NaN ( quiet , cn ) ) , no_excep 
(NaN( sign, cn) ) , no_excep 
(NaN(sign,cn) ) , no_excep 
( NaN ( quiet ,cn)), invalid 
finite(l,INT 0,\n.0), no_excep 
finite(0,INT 0,\n.0), no_excep 
finite(0,INT 0,\n.0), no_excep 
infinite 1, no_excep 
infinite 0, no_excep 
infinite 0, no_excep 
round (ds_value ds) p traps mode tiny acc emax emin");; 


Floating-point to decimal string conversion maps floating-point numbers to the corresponding decimal 
strings. Floating-point to decimal string conversion is defined in a relational style to permit more than one 
decimal string for a given floating-point number. The predicate “f p_ds_conv” take as arguments float- 
ing-point “f p”, floating-point precision “p” and decimal string “ds”: 


new_def inition ( 'fp_ds_conv' , 

"fp_ds_conv fp p ds = 

(is_s_NaN fp) => ( (ds=(signaling_nan, invalid) )\/(ds=(nan, invalid) ) ) | 

( is_q_NaN f p ) => ( ( ds= ( quiet_nan , no_excep ) ) \ / ( ds= ( nan , no_excep ) ) ) j 

(fp_is_zero fp/\fp_is_neg fp) => (ds = (nzero, no_excep)) j 

(fp_is_zero fp) => ( (ds=( pzero, no_excep) )\/(ds=(zero,no_excep) ) ) j 

(is_infinite f p/\f p_is_neg fp) => (ds = (ninf, no_excep) ) j 

(is_infinite fp) => ( (ds=( pinf ,no_excep) )\/(ds=( inf ,no_excep) ) ) j 

(ds = format (fp_value (i_finite fp) p) ,no_excep) " ) ; ; 


5.7 Comparison 


For any two arbitrary floating-point numbers, one and only one of the following relations must hold: 
“less than”, “equal”, greater than”, or “unordered”. The four relations between floating-point numbers are 
defined by the type: 

define_type 'relations' 'relations = less_than | equal | greater_than | 
unordered' ; ; 

The comparison operation can be defined in two optional ways: 1) by returning one of the possible 
four relations between the arguments; 2) by returning true or false on a given predicate. The first option is 
specified by the function: 

new_def inition ( 'relation' , 

"relation fpl fp2 pi p2 = 
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(is_s_NaN fpl\/ is_s_NaN fp2) => unordered, invalid | 

(is_q_NaN fpl\/ is_q_NaN fp2) => unordered, no_excep | 

(is_infinite fpl/\is_infinite fp2/\(fp_sign fpl = fp_sign fp2)) 

=> equal, no_excep | 

(is_infinite fpl/\(fp_is_neg fpl) => less_than, no_excep | 

(is_infinite fpl/\(fp_is_pos fpl) => greater_than, no_excep j 

(is_infinite fp2/\ (fp_is_neg fp2) => greater_than , no_excep ] 

(is_infinite fp2/\ (fp_is_pos fp2) => less_than, no_excep | 

(fp_value (finite fpl) pi) real_lt (fp_value (finite fp2) p2) 

=> less_than, no_excep | 

(fp_value (finite fpl) pi) = (fp_value (finite fp2) p2) 

=> equal, no_excep | 

greater_than, no_excep " ) ; ; 

If the comparison operation is defined in terms of predicates, the following HOL definitions list six 
predicates that must be provided by the implementation and a seventh predicates which is desirable. The 
predicates are defined in terms of the function “relation”. 

new_def inition ( 'EQ' , 

"EQ fpl fp2 pi p2 = 

(FST (relation fpl fp2 pi p2 ) = less_than) => F, no_excep | 

(FST (relation fpl fp2 pi p2) = equal) => T, no_excep j 

(FST (relation fpl fp2 pi p2) = greater_than ) => F, no_excep | 

F, no_excep");; 

new_def inition ( 'NE' , 

"NE fpl fp2 pi p2 = 

(FST (relation fpl fp2 pi p2) = less_than) => T, no_excep | 

(FST (relation fpl fp2 pi p2 ) = equal) => F, no_excep | 

(FST (relation fpl fp2 pi p2 ) = greaterjthan ) => T, no_excep | 

T, no_excep");; 

new_def inition ( ~GT' , 

"GT fpl fp2 pi p2 = 

(FST (relation fpl fp2 pi p2 ) = less_than) => F, no_excep | 

(FST (relation fpl fp2 pi p2 ) = equal) => F, no_excep | 

(FST (relation fpl fp2 pi p2) = greater_than ) => T, no_excep | 

F, invalid");; 

new_def inition ( 'GE~ , 

"GE fpl fp2 pi p2 = 

(FST (relation fpl fp2 pi p2) = less_than) => F, no_excep | 

(FST (relation fpl fp2 pi p2) = equal) => T, no_excep | 

(FST (relation fpl fp2 pi p2) = greater_than ) => T, no_excep | 

F, invalid");; 

new_def inition ( ~LT' , 

"LT fpl fp2 pi p2 = 

(FST (relation fpl fp2 pi p2) = less_than) => T, no_excep | 

(FST (relation fpl fp2 pi p2) = equal) => F, no_excep | 

(FST (relation fpl fp2 pi p2) = greater_than ) => F, no_excep | 

F, invalid");; 
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new_def inition ( 'LE' , 

"LE fpl fp2 pi p2 - 

(FST (relation fpl £p2 pi p2) = less_than) *=> T, no_excep | 

(FST (relation fpl fp2 pi p2) = equal) => T, no_excep | 

(FST (relation fpl fp2 pi p2) * great er_than ) => F, no_excep j 

F, invalid");; 

new_def inition ( 'UN ' , 

"GT fpl fp2 pi p2 = 

(FST (relation fpl fp2 pi p2) = less_than) => F, no_excep | 

(FST (relation fpl fp2 pi p2) = equal) -> F, no_excep | 

(FST (relation fpl fp2 pi p2) = greater_than ) => F, no_excep | 

T, no_excep" ) ; ; 

5.8 Supporting functions 

This section includes some functions that are used within the definition of the IEEE-854 standard in 
the HOL system, but are more of a general nature than specific to the standard. 

The function “rea l_t o_i n t _r e a 1 r” delivers the integer real number nearest “r”. If two such 
numbers exist, “real_to_int_real” delivers an even integer real number. 

new_def inition ( ~ real_to_int_real ' , 

"real_to_int_real r = ( 

(r real_ge &0) => 

(& 

(( (fit (ceiling r) real_sub r) real_lt (r real_sub &(floor r) ) ) => ceiling r | 

((r real_sub &(floor r)) real_lt (&(ceiling r) real_sub r)) => floor r | 
(@n.((n = ceiling r)\/(n = floor r))/\(EVEN n)))) 

I 

let rn = abs r in 
(real_neg (fit 

((( fit (ceiling rn) real_sub rn) real_lt (rn real_sub &( floor rn))) => ceiling rn | 
((rn real_sub &(floor rn) ) real_lt (&(ceiling rn) real_sub rn)) => floor rn j 
(§n.((n = ceiling rn)\/(n = floor rn))/\(EVEN n))))) 

)");; 

Logarithm base n of x is defined in terms of the natural logarithm provided in the reals library 
new_def inition ( ' log' , 

"log n x = (In x) real_mul (real_inv (In (& n) ))");; 

The ceiling function when applied to a number “xsreal” will return the least number “n:num” 
greater or equal to x. This function is only valid for non-negative values of x. When x is negative ceiling of 
x is zero. 

new_definition( 'ceiling' , 

"ceiling x = ©n . { ( fit n) real_ge x) /\ (!i.((& i) real_ge x) ==> n <= i)");; 

The floor function when applied to a number “x : real” will return the greatest “n : num” less than or 
equal to x. Floor is only valid for non-negative arguments. When x is negative floor of x is an undefined 
natural number. 

new_def inition ( ' floor ' , 


40 



"floor x = ®n .((& n) real_le x) /\ (li.((& i) real_le x) ==> i <= n)");; 

The function rsign returns a 1 if its argument of type “ : real” is negative and 0 otherwise. 

new_definition( 'rsign' , 

"rsign r = r real_lt &0 => 1 | 0" );; 
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